Introduction

In this exam, our aim is forecasting hourly solar power. There are different the weather measurements for four grid points (coordinates) nearby the power plant. Weather measurements are:

TEMP: Temperature at the provided location.

DSWRF: This is the short version of downward shortwave radiation flux which is known to be highly related to the production level.

CLOUD_LOW_LAYER: This is total cloud cover data (in terms of percentage) for low-level type of clouds.

We will make prediction for January 2021 and December 2020 hourly. Time series decomposition and linear regression are the methods that we use. This exam assignment includes two different approaches to the forecasting using these 2 different methods.

To start with, I check the distribution of hourly solar energy production during the day.

In our solar energy production data, there is zero energy production at night throughout 7-8 hours in general because of lack of daylight. Therefore, providing hourly solar power prediction is difficult. I decided to work daily data by taking the mean of the hourly consumption by date. Besides, temperature, cloud cover data, downward shortwave radiation flux are related variables about production amount.

Below, you can see our daily solar power plant data.

When we examine the graph, we can say that the amount of solar energy generation shows seasonality. The mean of ts object is not constant and the variance of solar energy production is changeable. Therefore, solar energy production time series is not stationary with respect to its mean and variance.

When seasonality is detailed, it can be said that while the mean of solar energy production increases in summer months, there is a decrease in winter months. The reason for this can be attributed to the longer days in summer and the increase in the time to benefit from sunlight.

There are serious autocorrelations in each lag. In the long run, it can be observed that the data from one day is significantly correlated with the ones from other day. Taking the Moving Average can be good idea to stabilize the data.

In partial autocorrelation graph, lag1 and lag25 are are significantly high. High pacf in lag1 is evidence that AR1 method can be applied to data to make it stationary.

Forecasting with time series analysis

Using functions of time series object is useful for time series decomposition. To start with, I construct the time series object with frequency 7 in order to examine the weekly seasonality of the solar production data. I select range of train data between 9th October 2019 and 30th November 2020. In this method, the main aim is to reach stationarity.

Below, you can see the daily mean solar power energy production data with frequency 7.

Variance is not stable over weeks and fluctuated a lot. In order to stabilize variance, I try to take a log of the data object.

Above, you can see the graph of ts object taken log function. Variance seems similar the older one when I use log function. Therefore, I continue with normal ts object.

I can start decomposition of object.

Due to the behaviour of variance, I select the multiplicative decomposition method.

Variance seems more stable and mean is around 1, which is fine. Let’s check the ACF and PACF function.

By looking the Acf graph, I can say autocorrelations in lags are not high except for lag1. The decrease in lag 2 after the high correlation in lag 1 indicates that the Autoregressive(1) model may be appropriate.

Then, to check the stationarity of random data remaining from decomposition, Kpss test is used.

## 
## ####################### 
## # KPSS Unit Root Test # 
## ####################### 
## 
## Test is of type: mu with 5 lags. 
## 
## Value of test-statistic is: 0.0233 
## 
## Critical value for a significance level of: 
##                 10pct  5pct 2.5pct  1pct
## critical values 0.347 0.463  0.574 0.739

The result of KPSS test statistic is 0.0233. When we compare the value with critical value at alpha level of 0.01 (0.0233<0.347), we can say that our data is stationary.

Continue with the arima model selection for random data. I use the auto.arima function.

MODEL 1- ARIMA(2,0,3):

## 
##  Fitting models using approximations to speed things up...
## 
##  ARIMA(2,0,2)           with non-zero mean : 82.22218
##  ARIMA(0,0,0)           with non-zero mean : 212.9822
##  ARIMA(1,0,0)           with non-zero mean : 215.6135
##  ARIMA(0,0,1)           with non-zero mean : 214.3141
##  ARIMA(0,0,0)           with zero mean     : 1209.848
##  ARIMA(1,0,2)           with non-zero mean : 109.0274
##  ARIMA(2,0,1)           with non-zero mean : 84.08502
##  ARIMA(3,0,2)           with non-zero mean : 78.27078
##  ARIMA(3,0,1)           with non-zero mean : 80.60108
##  ARIMA(4,0,2)           with non-zero mean : 79.8118
##  ARIMA(3,0,3)           with non-zero mean : 78.51367
##  ARIMA(2,0,3)           with non-zero mean : 75.78551
##  ARIMA(1,0,3)           with non-zero mean : 89.47641
##  ARIMA(2,0,4)           with non-zero mean : 77.35533
##  ARIMA(1,0,4)           with non-zero mean : 74.93015
##  ARIMA(0,0,4)           with non-zero mean : 85.96573
##  ARIMA(1,0,5)           with non-zero mean : 75.99302
##  ARIMA(0,0,3)           with non-zero mean : 89.59489
##  ARIMA(0,0,5)           with non-zero mean : 78.66864
##  ARIMA(2,0,5)           with non-zero mean : 78.68772
##  ARIMA(1,0,4)           with zero mean     : Inf
## 
##  Now re-fitting the best model(s) without approximations...
## 
##  ARIMA(1,0,4)           with non-zero mean : Inf
##  ARIMA(2,0,3)           with non-zero mean : 75.37309
## 
##  Best model: ARIMA(2,0,3)           with non-zero mean
## Series: data_random 
## ARIMA(2,0,3) with non-zero mean 
## 
## Coefficients:
##          ar1      ar2      ma1     ma2      ma3    mean
##       0.4733  -0.5006  -0.6982  0.1990  -0.3075  0.9970
## s.e.  0.1550   0.0786   0.1591  0.1049   0.0908  0.0025
## 
## sigma^2 estimated as 0.06866:  log likelihood=-30.55
## AIC=75.1   AICc=75.37   BIC=103.26

The best model from auto arima is ARIMA(2,0,3), means autoregressive(2) and moving average(3). AIC and BIC values are showed below.

## [1] 75.09654
## [1] 103.2607

I decided to try different combinations of Arima models to check the suitability of the result from auto arima and to find lower AIC, BIC values. I choose ARIMA(1,0,4), ARIMA(1,0,3), ARIMA(2,0,2), ARIMA(2,0,4) models as test models, which are similar to ARIMA(2,0,3).

MODEL 2- ARIMA(1,0,4):

## 
## Call:
## arima(x = data_random, order = c(1, 0, 4))
## 
## Coefficients:
##          ar1      ma1      ma2      ma3     ma4  intercept
##       0.4874  -0.7317  -0.3124  -0.1343  0.2724     0.9970
## s.e.     NaN      NaN      NaN      NaN     NaN     0.0024
## 
## sigma^2 estimated as 0.06855:  log likelihood = -33.25,  aic = 80.5
## [1] 80.50092
## [1] 108.6651

MODEL 3- ARIMA(1,0,3):

## 
## Call:
## arima(x = data_random, order = c(1, 0, 3))
## 
## Coefficients:
##           ar1     ma1      ma2      ma3  intercept
##       -0.2644  0.0645  -0.4300  -0.4108     0.9969
## s.e.   0.1769  0.1651   0.0693   0.0789     0.0024
## 
## sigma^2 estimated as 0.07039:  log likelihood = -38.61,  aic = 89.22
## [1] 89.22221
## [1] 113.3629

MODEL 4- ARIMA(2,0,2):

## 
## Call:
## arima(x = data_random, order = c(2, 0, 2))
## 
## Coefficients:
##          ar1      ar2      ma1     ma2  intercept
##       0.8397  -0.4813  -1.0707  0.1950     0.9970
## s.e.  0.0980   0.0627   0.1076  0.0985     0.0025
## 
## sigma^2 estimated as 0.06908:  log likelihood = -34.81,  aic = 81.61
## [1] 81.61478
## [1] 105.7555

MODEL 5- ARIMA(2,0,4):

## 
## Call:
## arima(x = data_random, order = c(2, 0, 4))
## 
## Coefficients:
##          ar1      ar2      ma1     ma2      ma3     ma4  intercept
##       0.5680  -0.4417  -0.8002  0.1642  -0.2723  0.0755     0.9970
## s.e.  0.2167   0.1450   0.2256  0.1361   0.1210  0.1336     0.0025
## 
## sigma^2 estimated as 0.06759:  log likelihood = -30.33,  aic = 76.66
## [1] 76.65744
## [1] 108.845

According to different ARIMA models tried, AIC and BIC values are very similar. I choose the best model ARIMA(2,0,3), which has the lowest AIC and BIC values.

Residuals from ARIMA(2,0,3) model has a constant mean. There are significant autocorrelations in lag-9 and lag-11, but in general autocorrelations between residuals are fine.

I can continue with the forecasting part using ARIMA(2,0,3) model. Here I select my train data until December 1, 2020 and estimate December 2020 and January 2021 using train data.

I predicted tomorrow using yesterday’s train data. Then I collected the forecasted values and actual daily values in a table. Below, you can see the graph of actual. vs predicted values for solar energy production from 2020-12-01 to 2021-01-31.

I had difficulty adding the trend and seasonality component suitable for the model from Arima. For this reason, I could not create a very good model, actual and predicted values are not close enough.

Later, I will test the performance of forecasting with time series analysis.

Forecasting with regression

To begin with regression analysis, I check the correlations between solar energy production and different measures such as temperature, cloud cover data and downward shortwave radiation flux, from 4 different coordinates. I use the mean of cloud layer variables, maximum and minimum oftemprature variables and mean downward shortwave radiation flux variables.

According to correlation matrix and correlation coefficients, average of minimum dswrf, average min of temperature and mean cloud are chosen as regressors.

Regression model 1 is formed.

MODEL 1:

## 
## Call:
## lm(formula = Production ~ avgmindswrf + tavg + meancloud, data = regdata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5810 -0.5332  0.1469  0.8669  2.9815 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.7140662  0.1652293  22.478   <2e-16 ***
## avgmindswrf  0.0160728  0.0008977  17.905   <2e-16 ***
## tavg        -0.5076511  0.4476387  -1.134    0.257    
## meancloud   -1.5950636  0.1134868 -14.055   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.211 on 479 degrees of freedom
## Multiple R-squared:  0.7619, Adjusted R-squared:  0.7604 
## F-statistic:   511 on 3 and 479 DF,  p-value: < 2.2e-16

When I check the residuals from linear regression model1, I realized lag1 is very high in ACF graph. To improve my model, I will add difference of normal production and one day shifted.

MODEL 2:

## 
## Call:
## lm(formula = Production ~ avgmindswrf + tavg + meancloud + lag1, 
##     data = regdata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.6256 -0.5620  0.0603  0.7468  3.2916 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.5581307  0.1490765  23.868   <2e-16 ***
## avgmindswrf  0.0154022  0.0008067  19.094   <2e-16 ***
## tavg         0.1236328  0.4050432   0.305     0.76    
## meancloud   -1.2764200  0.1058494 -12.059   <2e-16 ***
## lag1         0.2984571  0.0272186  10.965   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.084 on 477 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.8099, Adjusted R-squared:  0.8083 
## F-statistic: 507.9 on 4 and 477 DF,  p-value: < 2.2e-16

Residual standard error decreases from 1.21 to 1.084, when I add lag1 variable. Besides, adjusted R-squared value is increased to 0.8083, which is good. To utilize the seasonality of solar energy production data, I will add the month of the date into the regression model.

MODEL 3:

## 
## Call:
## lm(formula = Production ~ avgmindswrf + tavg + meancloud + lag1 + 
##     month, data = regdata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.3798 -0.5213  0.0327  0.6479  3.2130 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.558587   0.266983   9.583  < 2e-16 ***
## avgmindswrf  0.020463   0.001558  13.132  < 2e-16 ***
## tavg        -0.161763   0.673604  -0.240   0.8103    
## meancloud   -0.963911   0.127659  -7.551  2.3e-13 ***
## lag1         0.291431   0.026721  10.907  < 2e-16 ***
## month2       0.061553   0.244074   0.252   0.8010    
## month3      -0.379166   0.292871  -1.295   0.1961    
## month4      -0.091635   0.348230  -0.263   0.7926    
## month5      -0.474394   0.446100  -1.063   0.2881    
## month6      -0.848439   0.494002  -1.717   0.0866 .  
## month7      -1.147042   0.542679  -2.114   0.0351 *  
## month8      -0.568809   0.489469  -1.162   0.2458    
## month9      -0.058723   0.430388  -0.136   0.8915    
## month10      0.632620   0.317616   1.992   0.0470 *  
## month11      0.457675   0.220076   2.080   0.0381 *  
## month12      0.319753   0.194984   1.640   0.1017    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.051 on 466 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.8255, Adjusted R-squared:  0.8198 
## F-statistic: 146.9 on 15 and 466 DF,  p-value: < 2.2e-16

Residual standard error decreases to 1.051, when I add month variable. Besides, adjusted R-squared value is increased to 0.8198, which is good. In order to check the trend component, I will add the trend into the regression model. Besides, I think that I can add outliers as regressors then I plot the data again. I choose the outlierbig=1, if the production larger than 8. If the production smaller than 2, I choose the outlier small=1. Other days’ outliersmall and outlierbig variables are 0.

Then, I added the trend, outliersmall and outlierbig components into the model.

MODEL 4:

## 
## Call:
## lm(formula = Production ~ outlierbig + outliersmall + avgmindswrf + 
##     meancloud + lag1 + (month) + trend + tavg, data = regdata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.3204 -0.4966  0.0439  0.5764  2.7858 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   3.0814569  0.2592031  11.888  < 2e-16 ***
## outlierbig    0.9683714  0.1466244   6.604 1.10e-10 ***
## outliersmall -2.1967871  0.1758801 -12.490  < 2e-16 ***
## avgmindswrf   0.0156264  0.0014443  10.819  < 2e-16 ***
## meancloud    -0.5773344  0.1127084  -5.122 4.44e-07 ***
## lag1          0.2061523  0.0236019   8.735  < 2e-16 ***
## month2       -0.0690650  0.2129010  -0.324  0.74578    
## month3       -0.5373661  0.2530248  -2.124  0.03422 *  
## month4       -0.2168668  0.2991166  -0.725  0.46880    
## month5       -0.5373911  0.3798257  -1.415  0.15779    
## month6       -0.7751676  0.4208638  -1.842  0.06614 .  
## month7       -1.1965070  0.4591396  -2.606  0.00946 ** 
## month8       -0.6839578  0.4137136  -1.653  0.09896 .  
## month9        0.0258094  0.3701903   0.070  0.94445    
## month10       0.5769282  0.2743634   2.103  0.03602 *  
## month11       0.4778200  0.1891149   2.527  0.01185 *  
## month12       0.4505080  0.1646180   2.737  0.00644 ** 
## trend         0.0008361  0.0003202   2.612  0.00931 ** 
## tavg          0.0799507  0.5710723   0.140  0.88872    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8836 on 463 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.8774, Adjusted R-squared:  0.8726 
## F-statistic: 184.1 on 18 and 463 DF,  p-value: < 2.2e-16

Our model has 0.8836 residual standard error and its adjusted r-squared is 0.8726. However, as I did not expect, average of minimum temperature has an insignificant coefficient. I delete the temperature regressor and the intercept part to talk about the significance of coefficients.

FINAL REGRESSION MODEL:

## 
## Call:
## lm(formula = Production ~ -1 + outlierbig + outliersmall + avgmindswrf + 
##     meancloud + lag1 + (month) + trend, data = regdata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.3238 -0.4996  0.0453  0.5689  2.7918 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## outlierbig    0.9701250  0.1459340   6.648 8.38e-11 ***
## outliersmall -2.1949084  0.1751821 -12.529  < 2e-16 ***
## avgmindswrf   0.0156159  0.0014408  10.838  < 2e-16 ***
## meancloud    -0.5812354  0.1090944  -5.328 1.55e-07 ***
## lag1          0.2062029  0.0235742   8.747  < 2e-16 ***
## month1        3.0869607  0.2559337  12.062  < 2e-16 ***
## month2        3.0189482  0.3119652   9.677  < 2e-16 ***
## month3        2.5611894  0.3657564   7.002 8.87e-12 ***
## month4        2.8885085  0.4066435   7.103 4.61e-12 ***
## month5        2.5772804  0.4771882   5.401 1.06e-07 ***
## month6        2.3463240  0.4980230   4.711 3.26e-06 ***
## month7        1.9327715  0.5009829   3.858 0.000131 ***
## month8        2.4405904  0.4670916   5.225 2.64e-07 ***
## month9        3.1488179  0.4117534   7.647 1.19e-13 ***
## month10       3.6912579  0.3024606  12.204  < 2e-16 ***
## month11       3.5767397  0.2464815  14.511  < 2e-16 ***
## month12       3.5421502  0.2365162  14.976  < 2e-16 ***
## trend         0.0008378  0.0003196   2.621 0.009052 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8827 on 464 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.9832, Adjusted R-squared:  0.9825 
## F-statistic:  1506 on 18 and 464 DF,  p-value: < 2.2e-16

I examine the coefficient of regressors, it can be said that every independent variable has small p value that means their effects are very significant. Therefore, outlierbig, outliersmall, average of minimum dswrf, mean cloud, lag1, month and trend are chosen as regressors.

By looking checkresiduals function, residuals mean are around zero. Residuals are distributed similar to the normal distribution and they aren’t significantly correlated except for lag1.

I can continue with the forecasting part using final regression model. I used the time until December 1, 2020 for train data and I took December 2020 and January 2021 montha as test data, which should be estimated.

I updated the train and test data simultaneously and created test data using my final model.

As a result, I had a graph to compare the forecasted values with the actual values using the final regression model.

Above, you can see the graph that represents the predicted vs. actual value of solar energy production by daily.

We should check performance testing for each method, then we continue with the method which has lower WMAPE value to convert daily data to hourly data.

Comparison the results from Method A and Method B

Performance Testing of Method A

##    n    mean       sd     error       FBias      MAPE     MAD     WMAPE
## 1 62 4.75705 1.967251 0.2462171 -0.01099264 0.8786438 1.98657 0.4176055

By looking the test statistics, we can say that using time series decomposition with ARIMA(2,0,3) model is not preferable. Our model WMAPE value is 0.417, which is not low enough.

Performance Testing of Method B

##    n    mean       sd    error     FBias     MAPE      MAD     WMAPE
## 1 62 4.75705 1.967251 0.861172 0.0420124 0.546225 1.454534 0.3057639

By looking the test statistics, we can say that forecasting with regression model is more convenient to use than method A. Our model WMAPE value is 0.305, which is not low enough for our model to be a good model.

Test results of both methods are not satisfactory in terms of proficiency. With the problem of not having production in the evening, I thought that I could handle this 0 production values by converting it to daily data, but it did not give a very good result. It may be a good way to make the production prediction directly hourly to develop the model. Different time series models and regression models can be constructed for each hour separately, which will be good improvement.

Finally, I will convert daily forecasting to hourly by using the regression method that has the lower WMAPE than time series decomposition.

Transformation from daily data to hourly data

I have collected all the months of December and January in our data to find percentage of hourly solar energy production to daily mean. I try to find the contribution of each hour to the amount of solar energy production for each day in December and for each day in January.

Above, you can see the actual vs. predicted solar power energy production hourly between 1 December 2020 and 31 January 2021. Mid December and January, my predictions are above their real values.

For the development part of the final exam prediction, a more detailed analysis can be made to convert daily data to hourly data. By examination of hourly ratio of daily data, hourly production can have relations between downward shortwave radiation flux, cloud cover data and temperature. As a result, to improve the model, hourly behaviour of these variables can be observed.

Conclusion

Method A, Forecasting with time series decomposition used. Method B, Forecasting with regression analysis made. Regression analysis has better test results in daily prediction, so Method B is chosen for converting hourly data. Relationship between hours and effects of daily production is analyzed. Daily forecasting data converted into hourly production prediction. Actual and predicted hourly test data are drawn. Lots of overpredictions are observed in graph.

##      n    mean       sd error     FBias MAPE      MAD     WMAPE
## 1 1488 4.75705 8.160993     0 0.0420124  NaN 2.414238 0.5075074

Then, I tested the hourly forecasted value with the actual ones. The weighted mean absolute percentage error (WMAPE) is used to measure performance,it is resulted as 0.5, which is not small adequately. WMAPE can be reinterpreted for different areas with different weights. FBias closes to zero so my predictions are not very overpredicted or underpredicted.

Finally, I can say that working with daily data didn’t result with success. Therefore, bu sebeple saatlik oranlarımızın çok başarılı olmadığını gördük. I offer 2 suggestions for improving models and prediction in general.

1. Constructing forecasting model separately for 24 hours a day

2. Making better analyze the distribution of hours throughout the day after establishing a daily forecast model

Code

The RMD File can be found Here.